feat: idoctags serialization and deserialization matching the iso proposal#457
Merged
PeterStaar-IBM merged 29 commits intomainfrom Dec 17, 2025
Merged
feat: idoctags serialization and deserialization matching the iso proposal#457PeterStaar-IBM merged 29 commits intomainfrom
PeterStaar-IBM merged 29 commits intomainfrom
Conversation
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
Contributor
|
✅ DCO Check Passed Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
…cTags Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
vagenas
reviewed
Dec 16, 2025
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
cau-git
reviewed
Dec 17, 2025
examples/convert_to_idoctags.py
Outdated
Member
There was a problem hiding this comment.
Let's remove this comment since the dataset is not public.
vagenas
previously approved these changes
Dec 17, 2025
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
cau-git
approved these changes
Dec 17, 2025
dolfim-ibm
approved these changes
Dec 17, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
IDocTags Serialization Implementation
Overview
Implements bidirectional serialization between DoclingDocument and IDocTags format—a specialized XML-based markup language for structured document representation with geometric and semantic annotations.
Serialization Features
Core Capabilities:
Current Test Coverage
Outstanding Work (with FIXME's)
<thread id="int">and page-breaks. This will likely need some updates to the BaseSerializer. As such, I want to not include it in this PR../test/test_deserializer_idoctags.pythat do not pass. Currently, they are skipped but we need to make them work.Testing
Dump Mode Usage
Serialize DoclingDocuments from HuggingFace datasets to IDocTags format and generate a validation report:
What it does:
Config file:
If --config is omitted, a default config (idoctags_dump_config.json) is auto-generated. Key settings: dataset_name, dataset_subset, output_dir, report_path, limit.
Use --write-default-config to generate the config template without running the dump.
The result of,
is